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Abstract —Traditional Relational Topic Models provide a 
way to discover the hidden topics from a document network. 
Many theoretical and practical tasks, such as dimensional 
reduction, document clustering, link prediction, henefit from 
this revealed knowledge. However, existing relational topic 
models are based on an assumption that the number of 
hidden topics is known in advance, and this is impractical 
in many real-world applications. Therefore, in order to re¬ 
lax this assumption, we propose a nonparametric relational 
topic model in this paper. Instead of nsing fixed-dimensional 
probability distributions in its generative model, we use 
stochastic processes. Specifically, a gamma process is assigned 
to each document, which represents the topic interest of this 
document. Although this method provides an elegant solution, 
it brings additional challenges when mathematically modeling 
the Inherent network structure of typical document network, 
i.e., two spatially closer documents tend to have more similar 
topics. Furthermore, we require that the topics are shared 
by all the documents. In order to resolve these challenges, 
we use a subsampling strategy to assign each document a 
different gamma process from the global gamma process, and 
the subsampling probabilities of documents are assigned with 
a Markov Random Field constraint that Inherits the document 
network structure. Through the designed posterior Inference 
algorithm, we can discover the hidden topics and its number 
simultaneously. Experimental results on both synthetic and 
real-world network datasets demonstrate the capabilities of 
learning the hidden topics and, more importantly, the number 
of topics. 

Index Terms —Topic models, Nonparametric Bayesian 
learning. Gamma process, Markov random field 


I. Introduction 

U NDERSTANDING a corpus is significant for busi¬ 
nesses, organizations and individuals for instance the 
academic papers of IEEE, the emails in an organization 


and the previously browsed webpages of a person. One 
commonly accepted and successful way to understand a 
corpus is to discover the hidden topics in the corpus 
The revealed hidden topics could improve the 
services of IEEE, such as the ability to search, browse or 
visualize academic papers ; help an organization understand 
and resolve the concerns of its employees; help internet 
browsers understand the interests of a person and then 
provide accurate personalized services. Eurthermore, there 
are normally links between the documents in a corpus. A 
paper citation network 0 is an example of a document 
network in which the academic papers are linked by their 
citation relations; an email network 0 is a document 
network in which the emails are linked by their reply 
relations; a webpage network 0 is a document network 
in which webpages are linked by their hyperlinks. Since 
these links also express the nature of the documents, it is 
apparent that hidden topic discovery should consider these 
links as well. 

Similar studies focusing on the hidden topics discovering 
from the document network using some Relational Topic 
Models (RTM) ||6|-18( have already been successfully 
developed. Unlike the traditional topic models 0^0 that 
focus on mining the hidden topics from a document corpus 
(without links between documents), the RTM can make dis¬ 
covered topics inherit the document network structure. The 
links between documents can be considered as constrains 
of the hidden topics. 

One drawback of existing RTMs is that they are built 
with fixed-dimensional probability distributions, such as 
Dirichlet, Multinomial, Gamma and Possion distribution, 
which require their dimensions be fixed before use. Hence, 
the number of hidden topics must be specified in advance, 
and is normally chosen using domain knowledge. This is 
difficult and unrealistic in many real-world applications, 
so RTMs fail to find the number of topics in a document 
network. 

In order to overcome this drawback, we propose a 
Nonparametric Relational Topic (NRT) model in this paper, 
which removes the necessity of fixing the topic number. 
Instead of probability distributions, stochastic processes are 
adopted by the proposed model. Stochastic process can be 
simply considered as ‘infinite’ dimensional distribution^ 
In order to express the interest of a document on the 

*We only consider the pure-jump processes in this paper. Some contin- 
uous processes cannot be simply considered as the ‘infinite’ dimensional 
distributions. 
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‘infinite’ number of topics, we assign each document a 
Gamma process that has infinite components. An additional 
requirement for the Gamma process assignment is that the 
two linked documents should have a tendency to share 
similar topics. This is a common feature found in many 
real-world applications, and many literatures ©-0 have 
exploited this property in their work. In order to achieve the 
above requirement, we have formally defined two properties 
that any relational topic model of a document network 
should satisfy. First we use a global gamma process to 
represent a set of base components that is shared by all 
documents. This is important because users are not inter¬ 
ested in analyzing documents in a database without sharing 
any common topics Our model achieves the dehned 
properties through: 1) thinning the global gamma process 
with document-dependent probabilities; 2) adding a Markov 
Random Field constraint to the thinning probabilities to 
retain the network structure. Finally, we assign each docu¬ 
ment with a gamma process that inherits both the content 
of the document and the link structure. Two sampling 
algorithms are designed to learn the proposed model under 
different conditions. Experiments with document networks 
show some efficiency in learning what the hidden topics are 
and superior performance the model’s ability to learn the 
number of hidden topics. It is worth noting that, although 
we use document networks as examples throughout this 
paper, our work can be applied to other networks with node 
features. 

The main contributions of this paper are to: 

1) propose a new nonparametric Bayesian model which 
can relax the topic number assumption used in the 
traditional relational topic models; 

2) design two sampling inference algorithms for the pro¬ 
posed model: a truncated version and an exact version. 

The rest paper is sttTictured as follows. Section II sum¬ 
marizes the related work. The proposed NRT model is 
presented in Section III and we have illustrated the detailed 
derivations of its sampling inference in Section IV. Section 
V presents experimental results both on the synthetic and 
real-world data. Finally, Section VI concludes this study 
with a discussion on future directions. 

II. Related Work 

In this section, we briefly review the related work of this 
paper. The hrst part summarizes the literature on relational 
topic models. The second part summarizes the literatures 
on nonparametric Bayesian learning. 

A. Topic Models with network 

Our work in this paper aims to model the data with 
the network structure as a constraint. Since social network 
and citation network are two explicit and commonly-used 
networks in the data mining and machine learning areas, 
some extensions of the traditional topic models try to adapt 
to these networks. For the social network, an Author- 
Recipient-Topic model m was proposed to analyze the 


categories of roles in social networks based on the rela¬ 
tionships of people in the network. A similar task was 
investigated in E) where social network structure was 
inferred from informal chat-room conversations utilizing 
the topic model © . The ‘noisy links’ and ‘popularity bias’ 
of social network was addressed by a properly designed 
topic model in and G3- As an important issue of 
social network analysis, communities 03 were extracted 
using a Social Topic Model fib) . The Mixed Membership 
Stochastic Blockmodel is another way to learn the mixed 
membership vector (i.e., topic distribution) for each node 
from a network structure E), but it did not consider the 
content/features of each node. For the citation network. 
Relational Topic Model (RTM) was proposed to infer the 
topics Q, discriminative topics |j^ and hierarchical topics 
0 from citation networks by introducing a link variable 
between two linked documents. Unlike RTM, a block was 
adopted to model the link between two document |T8),|[Tg. 
Considering the physical meaning of citation relations, a 
variable was introduced to indicate if the content of citing 
paper was inherited from cited paper or not lig, |2T|. 
In order to keep the document structure, Markov Random 
Field (MRF) was combined with topic model | [22| . The 
communities in citation network were also investigated 


In summary, existing relational topic models are all 
inherited from traditional topic models, so the number of 
topics needs to be hxed. It is unrealistic, in many real-world 
situations, to hx this number in advance. Our work tries 
to resolve this issue through the nonparametric learning 
techniques reviewed in the following subsection. 


B. Relational Topic Model 

Since the finite relational topic models are our compara¬ 
tive model, we introduce a relational topic model here in 
detail. The corresponding graphical representation is shown 
in Fig. [T] and the generative process is as follows, 

Dirichlet{a) 

(j)f, Dirichlet{j3) 

Zd,n ^ Discrete{9d) 

Wd,n ~ Discrete{(j)zj^ J 
Cdi,dj ^ GLM{zd.,Zd^) 


where 9d is the topic distribution of a document, (pk is the 
word distribution of topic k, Zd^n is the topic index of word 
n in document d, Wd.n is observed word n in document d. 
All these variables are same with the original EDA |[Tj. The 
different and significant part is the variable c, which denotes 
the observed document link. This model uses a Generalized 
Linear Model 1241 to model the generation of the document 
links. 

p{cdi,dj = 1 ) ~ GLM{zdi,Zdj) ( 2 ) 


One problem with this model is that the number of topics 
needs to be pre-dehned and for some real-world applica¬ 
tions, this is not trivial. 
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Fig. 1: Finite Relational-Topic-Model 


C. Nonparametric Bayesian Learning 

Nonparametric Bayesian learning p5| is a key approach 
for learning the number of mixtures in a mixture model 
(also called the model selection problem). Without predefin¬ 
ing the number of mixtures, this number is supposed to be 
inferred from the data, i.e., let the data speak. 

The traditional elements of probabilistic models are 
fixed-dimensional distributions, such as Gaussian distribu¬ 
tion, Dirichlet distribution 0^ Logistic Normal distribution 
and so on. All these distributions need to predefine 
their dimensions. In order to avoid this, Gaussian process 
p7) and Dirichlet process pS] are used to replace former 
fixed-dimensional distributions because of their infinite 
properties. Since the data is limited, the leamed/used atoms 
will also be limited even with these ‘infinite’ stochastic 
processes. 

Dirichlet Process can be seen as a distribution over 
distributions. Since a sample from Dirichlet Process defines 
a bunch of variables that satisfies Dirichlet distribution, 
Dirichlet process is a good alternative for the models with 
Dirichlet distribution as the prior. There are three different 
methods to construct this process; Black-MacQueen urn 
schema p9t , Chinese restaurant process pO) and stick 
breaking process p0[ . Although the processes that result 
from them are all Dirichlet processes, they can express 
different properties of Dirichlet process, such as the pos¬ 
terior distribution from Black-MacQueen urn schema, the 
clustering from Chinese restaurant process and the formal 
sampling function from stick breaking process. Based on 
these constructive processes, a Dirichlet process mixture 
HD is proposed, which is a kind of infinite mixture 
models. Infinite mixture models are the extension of Finite 
Mixture Models where there are a finite number of hidden 


components (topics) used to generate data. Another infinite 
mixture model is the Infinite Gaussian mixture model 
Normally, a Gaussian mixture model is used for continuous 
variables and a Dirichlet process mixture is used for discrete 
variables. An example use for a Dirichlet process is the 
hierarchical topic model composed by Latent Dirichlet 
Allocation (LDA) Q with a nested Chinese restaurant 
process p^ . By using a nested Chinese restaurant process 
as the prior, not only is the number of them not fixed, the 
topics in this model are also hierarchically organized. In 
order to learn the Dirichlet process mixture based models 
with an infinite property, the inference methods should be 
properly designed. There are two popular and successful 
methods to do this; Markov Chain Monte Carlo (MCMC) 
p4l and variational inference p5| . 

To summarize, nonparametric learning has been success¬ 
fully used for extending many models and applied in many 
real-world applications. However, there is still no work on 
the nonparametric extension of relational topic models. This 
paper uses a set of Gamma processes to extend the finite 
relational topic model to the infinite one. 


III. Nonparametric Relational Topic Model 


In this section, we present the proposed Nonparametric 
Relational Topic (NRT) model in detail. This model can 
handle the issue that the number of topics needs to be 
defined. 

The proposed model uses a Gamma process to express 
the interest of a document on infinite hidden topics. A 
gamma process GaP{a, H) is a stochastic process, 
where H is a base (shape) measure parameter and a is 
the concentration (scale) parameter. It also corresponds to 
a complete random measure. Let L = be a 

random realization of a Gamma process in the product 
space 1R+ x 0. Then, we have 


r - GaP{a,H) 

OO 

2 = 1 


(3) 


where S{-) is an indicator function, satisfies an improper 
gamma distribution with parameters gamma{0, a) and 
9i ~ H. When using L to express the document interest, the 
{0i)}“i in Eq. ([D^denotes the infinite number of topics and 
ill Eq.lD denotes the weights of infinite number 
of topics in a document. 

As illustrated in Eig. our idea is to assign each 
document a gamma process. This assignment should satisfy 
the following two properties; 


Property 1. Two Gamma processes of two documents with 
a link should have similar components/topics with each 
other. 


Property 2. All the Gamma processes of documents should 
share the same set of components/topics. 

In order to achieve the above properties, we firstly 
generate a global Gamma process, 

Eo ~ GaP{a,H) 
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Fig. 2: Illustration of Gamma process assignments for the doc¬ 
ument network. Each document is assigned a Gamma process 
which has infinite components (represented hy the fences in a 
document in the figure). Each fence denotes a hidden topic, and 
some examples are shown in the figure. The length of the fences 
denote the weights of different topics in a document. 

which is equal to 

OO 

To = '^k^(xk,ek)i Qk H 
k=l 

where {T^k,(^k}^=i is the shared global set of compo- 
nents/topics for documents. We then hope the components 
of the Gamma process for each document falls within the 
set of components/topics in the global Gamma process. We 
use a dependent thinned gamma process to achieve this 
goal. Its definition is as follow. 

Definition 1 (Thinned Gamma Process p7|). Suppose we 
have a gamma process F ~ GaP{a, H) and we know 
there are countably infinite points {(Tti, 0^)}“^. Then, we 
generate a set of independent binary variables {r}ffi 
(ri G {0, l}j. The new process, 

OO 

Ti (4) 

is still a gamma process, which is proofed by The 

{r^} can be seen as the indicators for the reservation of 
the point of original/global gamma process, so Fi is called 
thinned gamma process. 


where is a set of indicators of document d on the 

corresponding components. These are independent 

identical distributed random variables with Bernoulli distri¬ 
butions, 

rf. ~ Bernoulli{qf) (6) 

where q'^ denotes the probability of the Gamma process F^^ 
of document d with component k. Therefore, Property]^ is 
achieved. 

In order to make the linked documents have similar 
Gamma processes, we define a Subsampling Markov Ran¬ 
dom Field (MRF) to constrain the qf of all 

documents. 

Definition 2 (Subsampling Markov Random Field). The 
subsampling probabilities of all the documents on a compo¬ 
nent/topic in the global Gamma process have the following 
constraint, 

p{{4}d=i) = n 

C^clique{N etwork) 

^ ' \ <di,dj>eC ) 

(7) 

where Network is the document network, 'fi{C) is the 
energy function of MRF and Z{q) is the normalization part 
and also called partition function. 

Through this subsampling MRF constraint, the marginal 
distribution of each subsampling probability qf dependents 
on the values of its neighbors. Therefore, the of linked 
documents will be similar, which ensures the proposed NRT 
achieve Property [T] 

To sum up, the proposed Nonparametric Relational Topic 
(NRT) Model is, 

Fo - GaP{a, H) 

OO 

or Fo = 

k^l 

Pidd.k) = beta{qd,k]ao,co) ■ M{qd,k\q-d,-k) 

M{qd,k\q-d,-k) ocexp ^ ||gd,fc-«|P 

\ le-d(d,k) / 

rd,k ~ Ber{qd,k) 

OO 

Fd = ^ rd^k'n’kSet, 
k=l 


We can give each a Bernoulli prior p(ri = 1) = 
Pi. Apparently, different realizations of {ri} will lead to 
different gamma processes. Furthermore, the dependence 
between the different realizations of {ri} will also lead to 
dependence of the generated gamma processes. 

For each document, a thinned gamma process F^; is 
generated with Fg as the global process, 

OO 

rd = y2^krtSe^ (5) 

k=l 


With the {Fd}£^j^ for all the documents in hand, the 
generative procedure of the documents is as follow, 


/3d, fc 

'^d,n,k 

OO 

'^d,,n — ^ ^ '^d.,n,k 


Gamma{bQ, 1) 

Poisson{0k,nrd,kTTkl3d,k) 

OO 

Poisson(^ Ok^nrd,k^kPd,k) 


Considering the relationship between the poisson distribu¬ 
tion and the multinomial distribution, the likelihood part is 
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Fig. 3: Nonparametric Relational Topic (NRT) Model by depen¬ 
dent thinned Gamma Processes and Markov Random Field (MRF) 


equal to, 


7 ^. , / ^k,n'^d,k^kl^d,k \ 

Zd,n,m ~ Discrete{^^— --—) 

/ .t. ^k,n‘^d,k'^kPd,k 

m G [l,Wd,n] 

'^d,n,k ~ ^ ^ ~ 

m 

0 - ij 


This form is more convenient for the slice sampling of the 
model. Note that the g’s are not only with a beta distribution 
prior but also with a MRF constraint at the same time. We 
just use this constrain to make the learned g’s satisfy the 
desired property. 


IV. Model Inference 

The inference of the proposed (NRT) model is to com¬ 
pute the posterior distribution of latent variables given data, 


Here, we use Gibbs sampling method to get samples of 
this posterior distribution with a truncation (define a rela¬ 
tively big topic number). We also adopt the slice sampling 
technique | |40) to develop an exact sampling without the 
truncation. 

A. Gibbs Sampling 

It is difficult to perform posterior inference under infinite 
mixtures, and a common work-around solution in nonpara¬ 
metric Bayesian learning is to use a truncation method. This 
method is widely accepted, which uses a relatively big K 
as the (potential) maximum number of topics. 

Sampling qd^k- Since there are additional constraints for 
the variables q, they do not have a closed-formed posterior 
distribution. 

If rd,k = 1, 

P{qd,k\qd°k^~'^{l - qd,ky°~^ 

■exp - ^ \\qd,k-qif 

\ l&d{qd,k) 

If rd,k = 0, 

Piqd,k \• • •) oc 

•exp - ^ \\qd,k - qiW^ 

\ l&d{qd,k) 

Given this conditional distribution of qd,k, we can 
use the efficient A* sampling ED that is developed re¬ 
cently, because the conditional distribution can be de¬ 
composed into two parts; — qd,kY°^^~^ 

exp (- Eie s(qd,k)\^'ld',k- qiW^y The first part is easily 
sampled using a beta distribution (proposal distribution), 
and the second part is a bounded function. 

Sampling rd,k 

1) '^j,rd,j = 0 ^ rd,k = 1 

2 ) 3n,Wd,n,k > 0 rd,k = 1 

3) yn,Wd,n,k = 0 

a) if Vrr, Ud^n,k — O? 

p{rd,k = 1) oc qd,k ■ pois{0; 9k,nT^kPd,k) (10) 

n 

b) if VtT-, Ud,n,k — 0? 

Y^Hrd^k = 0) oc (l-gd,?;) ■ J|pois(O;0fc,„7rfc/3d_fe) 

n 

(11) 

c) if 3/1, y^d^n^k ^ 0? 

P^‘^Hrd,k = 0) oc (1 - qd,k) 

1 - f[pois(O;0fe,„7rfc/3d,fc) 
n 

Finally, we can use a discrete distribution to sample r 

by, 

p{rd,k = 1| • • •) 

^ _ Pird,k = 1) _ (13) 

Pird,k = 1) + ird,k = 0) + p(2) = 0) 






p{K, TT, q, r, 0, 13\{wd,n}, Network) 
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Algorithm 1: Truncated Version of Gibbs Sampling for 
NRT_ 

Input: Net, a document network with content Wd,n 
Output: K, {0k}k=i^ {T^k}k=i 
1 : randomly set initial values for K, {dk}^=i, {'^k}k=i 
iter = 1 ; 

while iter < maxuer do 

for each topic k do 

for each document d do 

for each word n of document d do 
Update Wd,n,k by Eq. 

end for 

Update Qd^k by Eq. a or 0 ; 

Update rd,k by Eq. i 
Update /3d,fc by Eq. 

end for 


where fixed positive decreasing sequence limk^aoCk = 0 . 
and 


p{^d,n,m — A;| ■ * * ) OC ^d,n,k ‘ 


n(t/d,n,m ^ Cfc) 

C/c 


13i ; 


14i ; 


Update Ok by Eq. (15 ; 
Update TTfe by Eq. (18i; 

end for 
end while 


Wd 


,n,k — ^ ^ ^(,^d,n, 7 n — 


( 20 ) 


where limk^ooCk = 0 

Sampling (slice sampling version) The construction 
of Gamma process (Tq ~ GaP{H, a)) is, 


To = ^ 


( 21 ) 


where 


fc=i 


Ek ~ Exp{ — ), Tk ~ Gamma{dk, —), 
a a 


l{dk = r) ~ poisson{'y), 


( 22 ) 


k=l 

Ok^H, j = [ H 
Jn 


Sampling I3d,k 

p{Pd,k\ ■ ■■) oc Gamma{wd.k + bo, -—rv) (14) 

TdMT^k + 1 

where Wd,.,k = Y.n^d,n,k 

Sampling 0k 

p{0k\ ■ ■■) (X Dir{ao + w.,i,fe,... ,ao + w.^N,k) (15) 


where w.^n,k = J2d^d,n,k 

Sampling Wd,n,k (truncated version) 


Piwd,n,l, ■ 

• ■ ; * * * ) OC Adv,lt(W(i^m ^d,n,l: ■ • 

• , fd,n,K 



(16) 

where 

^ ^k,n'^d,k'^k^d,k 

^d,n,k — 

Z^k ^k,n'^d,k'^k^d,k 

(17) 

Sampling (truncated version) 


piEk\ ■ ■ 

■ ) cx Gamma{\/K -\- w. . k, 7/- 7 ) 

P-.fe + 1 

(18) 


The prior of is, 

TTfc = Eke~"’"'‘ ^ Exp{ — ) ■ Gamma{dk, —) (23) 

a a 

and the posterior is, 

TTfc = {Ek,Tk) 

^ poiss{data\Eke~'^'‘) ■ Exp{Ek\—) ■ Gamma{Tk\dk,—) 

“ (24 

We can sampling this posterior by two gamma distribu¬ 
tions, 

p{Ek\Tk) ~ Gamma{Ek\w.^.^k + 1, — , , / - 

a ^ -b / 3 .,fe • e ''fc 

p{Tk\Ek) ~ Pois{w.^.^k\l3.,k ■ e.~'^'‘Ek) 

• Gamma{Tk\dk, —) 

“ (25) 

where w.^.^k = T.dT.n'^d,n,k and /3.,fe = Y.dPd,k- The 

conditional distribution for the indicator dk is. 


where w.^.^k = Y.dY.n'd’d,n,k and (d.^k = J2dl3d,k- 

The whole sampling algorithm is summarized in Algo¬ 
rithm [T] Note that the qk of different d are independent of 
each other given other variables. So the update of qk of 
different d can be implemented in a parallel fashion. 


p{dk = i| • • •) oc p{Tk\dk = i) ■ p{dk = i\{di}i^l) (26) 

The second factor is, 

p{dk = AidiYiZl) 

0 


B. Slice Sampling 

Although the truncated method are commonly accepted 
in the literature, maintaining a large number of components 
and their parameters is time and space consuming. An 
elegant idea (call slice sampling | [40| ) to resolve this 
problem is to introducing additional variables to adaptively 
truncate/select the infinite components. 

Sampling Wd,n,k (slice sampling version) In order to 
do slice sampling, sample slice variable as, 

Ud,n,ra = Unif{Q,C,k) (19) 


if i < dk-i 


l-F{Ck-i\i) 


= 


1 - F{Gk-i - 1|7) 


l-F(Cfc-i|7) A 

l-F{Gk-i-l\i)) 


if i = dk-i 
(l-/(0|7))/(0|7y-' 
if i = dk-i + h 


(27) 

where Gk is the number of items in fcth Poisson process 
and Gk ~ Poissipf). 
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Algorithm 2: Slice Version of Gibbs Sampling for NRT 
Input: Net, a document network with content Wd,n 
Output: K, {0k}k=i^ {'^k}k=i 


randomly set initial values for K, {Ok}k^i, 
iter = 1; 

while iter < maxiter do 

for each topic k do 

for each document d do 

for each word n of document d do 

Sample slice variable Ud^n,k by Eq. (19 1 
Update Wd,n,k by Eq. Q ; 
end for 

Update qd,k by Eq. m or 0; 

Update rd,k by Eq. ( 

Update Pd,k by Eq. 

end for 

Update Ok by Eq. ( fS]! 

Update TTfe by Eq. 

Update dk by Eq. ( 2^ 

end for 
end while 


13l 


14i ; 


Note that the dk, Ek and Tk are introduced additional 
variables. They are not in the original model, and their 
appearances are only for the sampling without the help of 
the truncation level. The whole slice sampling algorithm is 
summarized in Algorithm 

V. Experiments 

In this section, we evaluate the effectiveness of the 
proposed model in learning the hidden topics from doc¬ 
ument networks. Eirst, we use a small synthetic dataset to 
demonstrate the model’s ability to recover the number of 
available topics in the dataset. We then show its usefulness 
using real-world datasets. 

A. Experiments on synthetic data 

We generated synthetic data to explore the NRTs ability 
to infer the number of hidden topics from the document 
network. We chose a set of ground tmth numbers symbol¬ 
ised by K, D and W that refer to the number of topics, 
documents and keywords respectively. Then, we generate 
the K global topics by the VU-dimensional Dirichlet distri¬ 
bution parameterized by {ai,... ,aw} where ai — 1 Vi. 
Next, we generate the document interests on these topics by 
the iV-dimensional Dirichlet distribution parameterized by 
/3i,..., Pk} V/3i = 1. Now that we have the topics and the 
document interests on these topics, we can generate each 
document as follows: Eor each document d, Nd is chosen 
to be a number between ^ and N. 

Eor each word of document d, we firstly draw a topic 
from the document’s interest and then draw a word from 
the selected topic. Finally, we can obtain a matrix with 
rows as documents and columns as words, and each entry 
of this matrix denotes the frequency a particular word in a 



Fig. 5: The change of active topic number during the iteration. 


particular document. The next step is to generate the rela¬ 
tions between documents. For each pairs of documents, we 
compute the inner product between their topic distributions. 
In order to sparsify these relationships, we only retain the 
ones where their inner product is greater than 0.2. 

Here, we adjust values of K, D and W to generate a set 
of synthetic datasets. The distributions of the learned topic 
numbers K by the proposed algorithms are shown in Fig. 
1^ The subfiguers in the first column are from the truncated 
version of the NRT in Algorithm [T] and the subfiguers in 
the second column are from the slice version of the NRT in 
Algorithm]^ In each subfigure, the counts of topic numbers 
from all the iterations (max iteration number is set as 1,000 
with 100 bumin) are illustrated in bar charts. Despite the 
rough initial guess (K — D x 10), we can see that the 
recovered histogram for K appears to be very similar to 
the ground truth value with small variance. The sampled 
K at plotted across all Gibbs iterations, which shows the 
Markov Chain begin to mix well after around 400 samples. 


B. Experiments on real-world data 

The real-world datasets used here are: 

• Cora Datase0 The Cora dataset consists of 2708 
scientific publications. The citation network consists 
of 5429 links. Each publication in the dataset is 
described by a 0/1-valued word vector indicating the 
absence/presence of the corresponding word from the 
dictionary. The dictionary consists of 1433 unique 
words. 

• Citeseer Dataset The CiteSeer dataset consists of 
3312 scientific publications. The citation network con¬ 
sists of 4732 links. Each publication in the dataset is 
also described by a 0/1-valued word vector indicating 
the absence/presence of the corresponding word from 
the dictionary. The dictionary consists of 3703 unique 
words pT) . 

For each dataset, we use 5-fold cross validation to 
evaluate the performance of the proposed model comparing 
with Relational Topic Model. The whole dataset is equally 
split into five parts. At each stage, documents in one part 
are chosen for testing while the rest of the four parts are 
used for training. We used the implementation of RTM in 

^http;//linqs.cs.umd.edu/projects/projects/lbc/ 
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Fig. 4: The results of NRT (slice version and truncated version) on synthetic data. The left sub-figures denote the distribution of active 
topic number from slice version; the right sub-figures denote the distribution of the active topic number from the truncated version. 
In each sub-figure, the ground-truth of the topic number is given at the top, and the bars represent the frequencies of each possible 
active topic number. 


TABLE I: Statistics of Datasets 


Datasets 

# of documents 

# of links 

# of words 

Cora 

2,708 

5,429 

1,433 

Citeseer 

3,312 

4,732 

3,703 


Eq. Q from A Fast And Scalable Topic-Modeling Toolbo)(|^ 
for comparison. 

In order to quantitatively compare the proposed model 
with RTM, two evaluation metrics are designed for both 
real-world datasets: link prediction and document predic¬ 
tion. The link prediction is used to predict the links between 
test and training documents using learned topics. 

The basic idea is that there will be a link between two 
documents if they have similar interests on topics. The 


^http;//www.ics.uci.edu/asuncion/software/fast.htm#rtm 


evaluation equation is, 


Dtest Dtrain 

Lp = E E (5 , dj ) E * log{cos{Td.,Wn)) 

di dj nGdi 

(28) 

where Dtest is the number of test documents, Dtrain is 
the number of training documents, N!^ is the number of 
word n in document d, and 6{di,dj) is 1 if there is a link 
between di and dj', 0, otherwise. Td denotes the learned 
topic distribution of a training document d. Wn is the topic 
distribution (a K-dimensional vector) of a word n, which 
can be evaluated by, 




^n,k 

^n,l 


(29) 


where {9}^^^ are learned topics. Wn expresses the interest 
of word n on topics and Td expresses the interest of 
training document d on topics, so their inner product is 
used to evaluate the probability of their link. We do not 
consider the normalization here since it does not influence 
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the comparison made between two models on the same 
dataset, i.e., a “max” operator. 

For the word prediction, this basic idea is that a test 
document has an similar interest on topics with its linked 
training documents and its words are generated according 
to its interest. The evaluation equation is. 


Dtest 


K 


Wp = E EE 


di n^di k 
^ Dtrain 


(30) 


S{di,dj)Tdj 


where is the number of neighbors that document i has. 

The results on Cora dataset (5-fold) are shown in Fig. 
§lzli0 and and the results on Citeseer dataset (5- 
fold) are shown in Fig. [TT] and in which we 

have compared NRT with RTM under several settings. For 
clarity, we denote RTM with K = num as “RTMnwm”. 
For example, RTM20 means RTM with K = 20. 

Note that the slice version of NRT in Algorithm is 
used as the implementation of NRT. The reason is that slice 
version is more efficient than truncated version because the 
slice version does not need to keep the (relatively) large 
number of hidden topics in memory (the initial guess for the 
number of topics is normally set as larger than the number 
of documents). 

We notice that our algorithm mixed better than some of 
the K settings in RTM and is generally compatible with 
the rest. As shown in left subfigures in each group of data, 
the likelihood by NRT model is generally larger than the 
RTM under various settings. It means that the proposed 
model fits or explains the data better than the RTM. As 
with the synthetic case, we also plot the distribution of K. 
We compared our method with RTM in terms of link and 
word prediction. In terms of word prediction, our algorithm 
consistently outperform RTM in every category. In terms of 
link prediction, NRT’s performance is not universally better 
than RTM, where we noticed some less accurate results 
under some RTM settings. We can see that there is a trend 
for the link prediction with respective to the topic number 
in RTM. This trend comes from the evaluation equation 28 
The RTM with smaller topic number tends to have bigger 
provability of observed links, which has also been observed 
in D- At an extreme situation (K = 1), the RTM reaches 
its best performance on link prediction. The problem is 
how to choose the hidden topic number for RTM. Take 
Cora dataset as an example. The candidates of possible 
topic number are at least within [1, 2708]. However, for the 
proposed NRT model, the active topic number is automati¬ 
cally learned from the data (for Cora dataset K around 42). 
Without any prior domain knowledge, this topic number 
can achieve relatively good results on the link prediction 
considering its large range [1,2708]. In terms of overall 
result, we argue that in the absence of an accurate domain 
knowledge of K value, the NRT algorithm has allowed us 
achieving better and more robust performance compared 
with the current state-of-the-art methods. 


VI. Conclusions and future study 

Despite of the success of existing relational topic models 
in discovering hidden topics from document networks, 
they are based on the unrealistic assumption, for many 
real-world applications, that the number of topics can be 
easily predefined. In order to relax this assumption, we 
have presented a nonparametric relational topic model. In 
our proposed model, the stochastic processes are adopted 
to replace the fixed-dimensional probability distributions 
used by existing relational topic models which lead to 
the necessity of pre-defining the number of topics. At the 
same time, introducing stochastic processes leads to the 
difficulty with model inference, and we have therefore also 
presented truncated Gibbs and slice sampling algorithms 
for the proposed model. Experiments on both the synthetic 
dataset and the real-world dataset have demonstrated our 
method’s ability to inference the hidden topics and their 
number. 

In the future, we are interested in making the sampling 
algorithm scalable to large networks by using new network 
constrain methods instead of MRFs. Current MRF-based 
methods do not make the inference efficient enough. We 
believe that the network constraint methods can avoid this 
issue. 
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Fig. 9: Results of NRT and RTM under different settings (K = 20, 30,40, 50,60) on a fourth 5-fold of cora dataset. 
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Fig. 12: Results of NRT and RTM under different settings (K = 2, 5,10, 20, 30, 40, 50,100) on a second 5-fold of citeseer dataset. 
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